Conversation
This commit introduces two new metadata fields: - apicall_count: total count of all API calls made in the sample - import_count: total count of Import symbols in the sample
Note, when using rutils.warn(), flake8 raises an error. So using rutils.bold() for now.
mr-tz
left a comment
There was a problem hiding this comment.
looks good, pending the tests to succeed
|
I think this requires regenerating the files in |
|
Should be good to go once mandiant/capa-testfiles#239 is merged. |
| def find_code_capabilities( | ||
| ruleset: RuleSet, extractor: StaticFeatureExtractor, fh: FunctionHandle | ||
| ) -> Tuple[MatchResults, MatchResults, MatchResults, int]: | ||
| ) -> Tuple[MatchResults, MatchResults, MatchResults, FeatureSet]: |
There was a problem hiding this comment.
changing the signature of a function is a breaking change, so this should wait until the next major release.
| feature_counts.file = feature_count | ||
|
|
||
| # cumulatively count the total number of Import features | ||
| for feature, _ in file_features.items(): |
There was a problem hiding this comment.
use .keys() here to indicate that you won't use the value
| tabulate.PRESERVE_WHITESPACE = True | ||
|
|
||
| MIN_LIBFUNCS_RATIO = 0.4 | ||
| MIN_API_CALLS = 10 |
There was a problem hiding this comment.
where did these numbers come from? and how should i interpret them?
There was a problem hiding this comment.
MIN_LIBFUNCS_RATIO: When the total count of library function present in a sample is less then 40%, we inform users that capa might pick false positive matches from other functions that would have been classified as library functions. I don't have any statistical data to back this up other than this hex-rays blogpost.MIN_API_CALLS: When the sample has very few API calls, it is a strong indication that it might be packed/encrypted as regular programs tend to make a lot more than 10 calls (though, we have to run a benchmark across multiple sample to decide what's a good number here). For example this packed capa-testfile emits 0 API features, luckily we detect that it is packed with UPX. If that weren't the case, this banner could serve as an indication that the sample might packed.
There was a problem hiding this comment.
good explanations!
would you include the key parts here as a comment?
There was a problem hiding this comment.
also i'm interested to see how frequently this message is shown to users. I don't think our dogs will identify 40% of functions in most binaries, so i'm a little concerned this message will be shown too often.
have you had a chance to collect these stats against a large number of samples?
There was a problem hiding this comment.
I think it's still helpful information since we know there's most likely more library code than we've identified.
|
Stepping back here for a moment, let's consider if we want to implement this differently:
That way we can handle the various limitations/warnings consistently. The core extraction logic still resides in capa but we don't have to extend the meta data. Related: should we provide functionality to easier leverage this in other tools? Right now other tools need to reimplement the logic we have in |
|
@mr-tz this would require many fewer breaking changes, which i like |
|
Closing since this went stale and is fairly outdated. |
Closes #857.
This commit introduces two new metadata fields to result_document. Would this be considered a breaking change?
This would require regenrating the rdoc test files. see mandiant/capa-testfiles#239.
Checklist